{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# `create_report()`: generate profile reports from a pandas DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of `create_report` is to generate profile reports from a pandas DataFrame. `create_report` utilizes the functionalities and formats the plots from `dataprep`. It provides the following information:\n",
    "\n",
    "1. Overview: detect the types of columns in a dataframe\n",
    "2. Variables: variable type, unique values, distint count, missing values\n",
    "3. Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range\n",
    "4. Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness\n",
    "5. Text analysis for length, sample and letter\n",
    "6. Correlations: highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices\n",
    "7. Missing Values: bar chart, heatmap and spectrum of missing values\n",
    "\n",
    "In the following, we break down the report into different sections to demonstrate each part of the report."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we load the `titanic` dataset into a pandas dataframe and use it to demonstrate our functionality:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.datasets import load_dataset\n",
    "df = load_dataset(\"titanic\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate report via `create_report(df)`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After getting a dataset, we could generate the report object by calling `create_report(df)`. The following shows an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from dataprep.eda import create_report\n",
    "report = create_report(df, title='My Report')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have a report object, we can show it in the notebook, or open the report in browser:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# display(report)\n",
    "report.show_browser()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or just save the report to local:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "report.save('report.html')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "You can see the full report [here](../../_static/images/create_report/titanic_dp.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Enable/Disable sections"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Computing all the sections is time consuming. You can enable/disable sections using the following two approaches."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Just show a few sections by setting display argument. E.g., run the following code to show only the overview section and the variables section:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "report = create_report(df, display=['Overview', 'Variables'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. Just disable a few sections by setting enable to False. E.g., run the following code to disable interactions section:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "report = create_report(df, config={'interactions.enable': False})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview section"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we can see the types of columns and the statistics of the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "create_report(df, display=['Overview'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variables section"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we can see the statistics and plots for each of variable in the dataset.\n",
    "\n",
    "For numerical variable, the report shows quantile statistics, descriptive statistics, histogram, KDE plot, QQ norm plot and box plot.\n",
    "\n",
    "For categorical variable, the report shows text analysis, bar chart, pie chart, word cloud, word frequencies and word length.\n",
    "\n",
    "For datetime variable, the report shows line chart.\n",
    "\n",
    "Furthermore, the variables being displayed can also be sorted based on a number of criteria."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_report(df, display=['Variables'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Interactions section"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, the report will show an interactive plot, user can use the dropdown menu above the plot to select which two variables user wants to compare. \n",
    "\n",
    "By default, it show the scatter plot for all numerical columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_report(df, display=['Interactions'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also enable categorical variables by setting `interactions.cat_enable` to True. It will add categorical-categorical and categorical-numerical interactions:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Correlations section"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we can see the correlations between variables in Spearman, Pearson and Kendall matrices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_report(df, display=['Correlations'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Missing Values section"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we can see the missing values in the dataset through bar chart, spectrum and heatmap."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_report(df, display=['Missing Values'])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
